Background

Basic analytic framework for processing 16S rRNA gene amplicon data. Each R chunk represents a specific analysis. Depending on the makeup of the data, such as sample names, variable names or data subsets some R chunks are optional or will need modification as needed.

The code chunks perform the following functions:

  1. Data and environment initiation
  2. Factor reordering and renaming (optional)
  3. Data assessment
  4. Taxon prevalence estimations and filtering
  5. Data transformation
  6. Subsetting (optional)
  7. Community composition plotting
  8. Alpha diversity analysis
  9. Beta diversity analysis
  10. Constrained Correspondence Analysis (optional)
  11. Differential abundance testing

To be added: Proscutes, ANCOVA

The example data are dada2 resolved sequence variants. Briefly, extracted nucleic acid from stool samples collected from individually caged mice were amplified in triplicate using primers specific for the V4 region using primers 515F/806R. One group of mice (n=30) were treated with ampicillin for 3 days, and the other a vehicle control (n=15).

Data and environment initiation

We will begin by customizing our global settings, activating packages and loading our data into R using the following steps:

  1. Set global knitr options
  2. Set global ggplot2 theme and options
  3. Load libraries
  4. Load data

Set knitr global options

Useful for standardizing how R chunks are handled by knitr. There are quite a few options you can use in this section which can be read about here, however, setting a standard figure height and width as well as an output directory and prefix name for all knitted figures can be useful.

Load libraries

library("tidyverse")
packageVersion("tidyverse")
## [1] '1.1.1'
library("phyloseq")
packageVersion("phyloseq")
## [1] '1.20.0'
library("RColorBrewer")
packageVersion("RColorBrewer")
## [1] '1.1.2'
library("vegan")
packageVersion("vegan")
## [1] '2.4.3'
library("gridExtra")
packageVersion("gridExtra")
## [1] '2.2.1'
library("knitr")
packageVersion("knitr")
## [1] '1.17'
library("DESeq2")
packageVersion("DeSeq2")
## [1] '1.16.1'
library("plotly")
packageVersion("plotly")
## [1] '4.7.1'
library("microbiome")
packageVersion("microbiome")
## [1] '0.99.93'
library("ggpubr")
packageVersion("ggpubr")
## [1] '0.1.5'
library("randomForest")
packageVersion("randomForest")
## [1] '4.6.12'
library("data.table")
packageVersion("data.table")
## [1] '1.10.4'
library("corrplot")
packageVersion("corrplot")
## [1] '0.77'

Set global ggplot2 theme and options. This sets the plotting aesthetics for every ggplot2 for the rest of the document.

Read in your data

The output from a standard dada2 workflow should be an RDS file. In this case the file is called ps0.rds. You may have already merged your mapping file data (sample variables) with the rds file. However, you will likely add or modify this mapping file as you progress, so it is useful to initiate an import/merge of a modifiable mapping file.

# Read in an RDS file containing taxonomic and count information
ps0 <- readRDS("ps0.rds")

# Read in a mapping file containing sample variable information
map <- import_qiime_sample_data("mapping.txt")

# Merge the RDS file with the mapping file
ps0 <- merge_phyloseq(ps0, map)

# Perform a few sanity checks
sample_variables(ps0)
##  [1] "X.SampleID"           "NumberReads"          "BarcodeSequence"     
##  [4] "LinkerPrimerSequence" "Plate"                "Well"                
##  [7] "Name"                 "SampleNo"             "DNAPlateNumber"      
## [10] "Cell"                 "Experiment"           "Microbiota"          
## [13] "Sex"                  "GroupedCage"          "IndividualCage"      
## [16] "DayPostTreatment"     "Treatment"            "Virus"               
## [19] "DayPostInoculation"   "SurvivalStatus"       "DaysToDeath"         
## [22] "SampleCollected"      "SampesProcessed"      "Description"         
## [25] "continuous_var_1"     "continuous_var_2"
ntaxa(ps0)
## [1] 443
rank_names(ps0)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
get_taxa_unique(ps0, "Phylum")
## [1] "p__Bacteroidetes"   "p__Verrucomicrobia" "p__Firmicutes"     
## [4] "p__Tenericutes"     "p__Cyanobacteria"   "p__Proteobacteria" 
## [7] "p__Actinobacteria"  NA                   "p__Deferribacteres"

Factor reordering and renaming (optional)

The default sorting for ggplot2 is alphabetical. So if you want to make a box plot comparing Shannon diversity between wild-type and knockout mice, it will by default always place knockout on the left and wild-type on the right. However, you may wish to switch this so the knock-out is on the right and wild-type on the left.

This can be done on a plot-by-plot basis, however, it is likely that you will want all of your plots to reflect this customization throughout the entire analysis, so it is useful to have an R chunk at the very beginning of your workflow to specify order and label names.

In the example data, most of the analysis will be done comparing the sample variable “treatment” which is either KoolAid or Ampicillin in the mapping file. Due to default ordering, Ampicillin will always appear before Koolaid. We want the control displayed first (on the left of most plots). We also want to use the more formal “Vehicle” to indicate that a “vehicle control” was used. Koolaid is added to the water to encourage mice to drink the antibiotic laden water. This would be indicated in the methods of a manuscript, but the plots should be more formal and indicate that this was a vehicle control. The code chunk below provides examples for reordering and relabeling sample variable data.

# Examine the way the sample data look now
levels(sample_data(ps0)$Treatment)
## [1] "Ampicillin" "Koolaid"
# Reorder it so vehicle appears first
sample_data(ps0)$Treatment <- factor(sample_data(ps0)$Treatment, levels = c("Koolaid", "Ampicillin"))
levels(sample_data(ps0)$Treatment)
## [1] "Koolaid"    "Ampicillin"
# Change Koolaid into Vehicle
sample_data(ps0)$Treatment <- factor(sample_data(ps0)$Treatment, labels = c("Vehicle", "Ampicillin"))

# Check that it worked as you expected
sample_data(ps0)$Treatment
##  [1] Vehicle    Vehicle    Vehicle    Vehicle    Vehicle    Vehicle   
##  [7] Vehicle    Vehicle    Vehicle    Vehicle    Vehicle    Vehicle   
## [13] Vehicle    Vehicle    Vehicle    Ampicillin Ampicillin Ampicillin
## [19] Ampicillin Ampicillin Ampicillin Ampicillin Ampicillin Ampicillin
## [25] Ampicillin Ampicillin Ampicillin Ampicillin Ampicillin Ampicillin
## [31] Ampicillin Ampicillin Ampicillin Ampicillin Ampicillin Ampicillin
## [37] Ampicillin Ampicillin Ampicillin Ampicillin Ampicillin Ampicillin
## [43] Ampicillin Ampicillin Ampicillin
## Levels: Vehicle Ampicillin

ASV summary statistics

Data assessment consists of 2 steps:

  1. Evaluate Resolved Sequence Variant (RSV, formerly referred to as an OTU) summary statistics
  2. Detect and remove outlier samples

Begin by running the following R chunk to produce several summary plots and basic statistics about the RSV’s and samples in your data.

# Create a new data frame of the sorted row sums, a column of sorted values from 1 to the total number of individuals/counts for each RSV and a categorical variable stating these are all RSVs.
readsumsdf = data.frame(nreads = sort(taxa_sums(ps0), TRUE), 
                        sorted = 1:ntaxa(ps0),
                        type = "RSVs")


# Add a column of sample sums (total number of individuals per sample)
readsumsdf = rbind(readsumsdf,
                   data.frame(nreads = sort(sample_sums(ps0), TRUE),
                              sorted = 1:nsamples(ps0),
                              type = "Samples"))

# Make a data frame with a column for the read counts of each sample for histogram production
sample_sum_df <- data.frame(sum = sample_sums(ps0))

# Make plots
# Generates a bar plot with # of reads (y-axis) for each taxa. Sorted from most to least abundant
# Generates a second bar plot with # of reads (y-axis) per sample. Sorted from most to least
p.reads = ggplot(readsumsdf, aes(x = sorted, y = nreads)) +
  geom_bar(stat = "identity") +
  ggtitle("RSV Assessment") +
  scale_y_log10() +
  facet_wrap(~type, scales = "free") +
  ylab("# of Sequences")

# Histogram of the number of Samples (y-axis) at various read depths
p.reads.hist <- ggplot(sample_sum_df, aes(x = sum)) + 
  geom_histogram(color = "black", fill = "firebrick3", binwidth = 150) +
  ggtitle("Distribution of sample sequencing depth") + 
  xlab("Read counts") +
  ylab("# of Samples")

# Final plot, side-by-side
grid.arrange(p.reads, p.reads.hist, ncol = 2)

# Basic summary statistics
summary(sample_sums(ps0))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     368     806    1593   14086   33143   49325

There is an R script in the /scripts directory called plot_violin which encapsulates that entire R chunk into a single command.

The above data assessment is useful for getting an idea of 1) the overall taxonomic distribution of your reads (left plot). This will normally be a “long tail” with some taxa being highly abundant in the data tapering off to taxa with very few reads, 2) probably more valuable than the first plot is how many reads are in each sample (middle plot). Very low read count can be indicative of a failed reaction and 3) a histogram of the number of samples at various “bins” of read depth. Each of these plots will help give an understanding of how your data are structured across taxa and samples and will vary depending on the nature of your samples.

Samples with unexpectedly low number of sequences can be safely removed. This is an intuitive process and should be instructed by your understanding of the samples in your study. For example, if you have 5 samples from stool samples, one would expect to obtain thousands, if not several thousands of RSVs. This may not be the case for other tissues, such as spinal fluid or tissue samples. Similarly, you would not expect thousands of RSV from samples obtained from antibiotic treated organisms. Following antibiotic treatment you may be left with dozens or hundreds of RSVs. So contextual awareness about the biology of your system should guide your decision to remove samples based on RSV number. The basic idea is to remove samples with “unexpected” numbers of RSV.

Importantly, at each stage you should document and justify your decisions. If you are concerned that sample removal will alter the interpretation of your results, you should run your analysis on the full data and the data with the sample(s) removed to see how the decision affects your interpretation.

The above plots provide overall summaries about the number of RSVs found in all of your samples. However, they are not very useful for identifying and removing specific samples. This can be done using the following R chunk.

# Format a data table to combine sample summary data with sample variable data
ss <- sample_sums(ps0)
sd <- as.data.frame(sample_data(ps0)) # useful to coerce the phyloseq object into an R data frame
df <- merge(sd, data.frame("RSVs" = ss), by ="row.names") # merge ss with sd by row names. Rename ss to RSVs in the new data frame

# Plot the data by the treatment variable

y = 50 # Set a threshold for the minimum number of acceptable reads. Can start as a guess
x = "Treatment" # Set the x-axis variable you want to examine
label = "Well" # This is the label you want to overlay on the points that are below threshold y. Should be something sample specific

p.ss.boxplot <- ggplot(df, aes_string(x, y = "RSVs", fill = x)) + # x is what you assigned it above
  geom_boxplot(outlier.colour="NA") +
  scale_y_log10() +
  geom_hline(yintercept = y, lty = 2) + # Draws a dashed line across the threshold you set above as y
  geom_jitter(alpha = 0.5, width = 0.15, size = 2)
  # geom_text(aes_string(x, y="RSVs", label=label), size=3, nudge_x = 0.1) # This labels a subset that fall below threshold variable y and labels them with the label variable
p.ss.boxplot

(In progress) There is an R script in the /scripts directory called plot_violin which encapsulates that entire R chunk into a single command.

The example data does have a couple of samples with fewer than 500 RSVs. However, these come from samples obtained from antibiotic treated mice, so this fits our expectation. There is also a single sample with fewer than 1,000 RSVs in the control data which is extraordinarliy low compared to the other samples. This sample should definetely be considered for removal. When questionable samples arise you should take note of them so if there are samples which behave oddly in downstream analysis you can recall this information and perhaps justify their removal. The following R chunk shows how to identify and remove samples if needed.

# Use the same 'subset' strategy to plot text on only those samples with fewer than y numbers of RSVs to produce a table of samples
low.RSV.samples <- subset(df, RSVs <= 1000) # Set the number to 1,000 to capture that one sample in vehicle control that is lower than 1,000
low.RSV.samples
##     Row.names X.SampleID NumberReads BarcodeSequence LinkerPrimerSequence
## 15 806rcbc347 806rcbc347        1296    CTAGCTATGGAC                   NA
## 17 806rcbc349 806rcbc349         551    ACCTGGGAATAT                   NA
## 18 806rcbc350 806rcbc350        1224    CTCTGCCTAATT                   NA
## 19 806rcbc351 806rcbc351         775    ATATGACCCAGC                   NA
## 20 806rcbc352 806rcbc352         760    CTCTATTCCACC                   NA
## 27 806rcbc359 806rcbc359        1042    AAGTGGCTATCC                   NA
## 29 806rcbc361 806rcbc361         835    TCGATTGGCCGT                   NA
## 30 806rcbc362 806rcbc362         943    GCATTACTGGAC                   NA
## 31 806rcbc363 806rcbc363        1059    TTGGGCCACATA                   NA
## 32 806rcbc364 806rcbc364        1149    CACACAAAGTCA                   NA
## 37 806rcbc369 806rcbc369        1269    GTTCCTCCATTA                   NA
## 38 806rcbc370 806rcbc370        1370    ACCTGTCCTTTC                   NA
## 39 806rcbc371 806rcbc371         783    GTTCACGCCCAA                   NA
## 41 806rcbc373 806rcbc373         971    CATGCCAACATG                   NA
## 43 806rcbc375 806rcbc375         598    CCTACATGAGAC                   NA
## 44 806rcbc376 806rcbc376        1136    TCCGTGGTATAG                   NA
## 45 806rcbc377 806rcbc377        1324    TCTACGGCACGT                   NA
##    Plate Well       Name SampleNo DNAPlateNumber Cell Experiment
## 15     4  E12 806rcbc347       60              1  E12  LBT2017_1
## 17     4   F2 806rcbc349       62              1   F2  LBT2017_1
## 18     4   F3 806rcbc350       63              1   F3  LBT2017_1
## 19     4   F4 806rcbc351       64              1   F4  LBT2017_1
## 20     4   F5 806rcbc352       65              1   F5  LBT2017_1
## 27     4  F12 806rcbc359       72              1  F12  LBT2017_1
## 29     4   G2 806rcbc361       74              1   G2  LBT2017_1
## 30     4   G3 806rcbc362       75              1   G3  LBT2017_1
## 31     4   G4 806rcbc363       76              1   G4  LBT2017_1
## 32     4   G5 806rcbc364       77              1   G5  LBT2017_1
## 37     4  G10 806rcbc369       82              1  G10  LBT2017_1
## 38     4  G11 806rcbc370       83              1  G11  LBT2017_1
## 39     4  G12 806rcbc371       84              1  G12  LBT2017_1
## 41     4   H2 806rcbc373       86              1   H2  LBT2017_1
## 43     4   H4 806rcbc375       88              1   H4  LBT2017_1
## 44     4   H5 806rcbc376       89              1   H5  LBT2017_1
## 45     4   H6 806rcbc377       90              1   H6  LBT2017_1
##      Microbiota   Sex GroupedCage IndividualCage DayPostTreatment
## 15 JAX C57BL6/J FALSE           L             L5                3
## 17 JAX C57BL6/J FALSE           M             M2                3
## 18 JAX C57BL6/J FALSE           M             M3                3
## 19 JAX C57BL6/J FALSE           M             M4                3
## 20 JAX C57BL6/J FALSE           M             M5                3
## 27 JAX C57BL6/J FALSE           O             O2                3
## 29 JAX C57BL6/J FALSE           O             O4                3
## 30 JAX C57BL6/J FALSE           O             O5                3
## 31 JAX C57BL6/J FALSE           P             P1                3
## 32 JAX C57BL6/J FALSE           P             P2                3
## 37 JAX C57BL6/J FALSE           Q             Q2                3
## 38 JAX C57BL6/J FALSE           Q             Q3                3
## 39 JAX C57BL6/J FALSE           Q             Q4                3
## 41 JAX C57BL6/J FALSE           R             R1                3
## 43 JAX C57BL6/J FALSE           R             R3                3
## 44 JAX C57BL6/J FALSE           R             R4                3
## 45 JAX C57BL6/J FALSE           R             R5                3
##     Treatment Virus DayPostInoculation SurvivalStatus DaysToDeath
## 15    Vehicle    na               d_11           Dead          11
## 17 Ampicillin    na               d_11          Alive          na
## 18 Ampicillin    na               d_11           Dead          11
## 19 Ampicillin    na               d_11           Dead           7
## 20 Ampicillin    na               d_11          Alive          na
## 27 Ampicillin    na               d_11          Alive          na
## 29 Ampicillin    na               d_11          Alive          na
## 30 Ampicillin    na               d_11           Dead          11
## 31 Ampicillin    na               d_11           Dead          10
## 32 Ampicillin    na               d_11          Alive          na
## 37 Ampicillin    na               d_11          Alive          na
## 38 Ampicillin    na               d_11           Dead          10
## 39 Ampicillin    na               d_11           Dead          18
## 41 Ampicillin    na               d_11           Dead          11
## 43 Ampicillin    na               d_11          Alive          na
## 44 Ampicillin    na               d_11          Alive          na
## 45 Ampicillin    na               d_11          Alive          na
##    SampleCollected SampesProcessed Description continuous_var_1
## 15              LT              QT  806rcbc347         102.8443
## 17              LT              QT  806rcbc349        1293.2978
## 18              LT              QT  806rcbc350         534.2119
## 19              LT              QT  806rcbc351        3148.8693
## 20              LT              QT  806rcbc352        3433.2437
## 27              LT              QT  806rcbc359        1799.1402
## 29              LT              QT  806rcbc361        1428.9089
## 30              LT              QT  806rcbc362        5582.6534
## 31              LT              QT  806rcbc363        2862.9539
## 32              LT              QT  806rcbc364         438.9060
## 37              LT              QT  806rcbc369        3270.4070
## 38              LT              QT  806rcbc370        4251.9886
## 39              LT              QT  806rcbc371        1511.8590
## 41              LT              QT  806rcbc373        2464.2605
## 43              LT              QT  806rcbc375         513.6356
## 44              LT              QT  806rcbc376        1178.5029
## 45              LT              QT  806rcbc377        8920.0735
##    continuous_var_2 RSVs
## 15         2.730802  921
## 17         3.701808  368
## 18         1.938974  786
## 19         2.695727  553
## 20         2.212551  512
## 27         2.444594  722
## 29         2.113268  598
## 30         2.420489  673
## 31         2.695764  747
## 32         2.161772  806
## 37         2.145264  898
## 38         2.244468  947
## 39         2.872685  557
## 41         3.137555  682
## 43         1.993803  413
## 44         2.226764  809
## 45         2.789934  855
# You can use this data to find a unique identifier to remove those specific samples
ps0 # View the original number of samples
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 443 taxa and 45 samples ]
## sample_data() Sample Data:       [ 45 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 443 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 443 tips and 441 internal nodes ]
ps1 <- ps0 %>%
  subset_samples(
    Name != "806rcbc347"
  )
ps1 # The sample number should be reduced according to your expectations
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 443 taxa and 44 samples ]
## sample_data() Sample Data:       [ 44 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 443 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 443 tips and 441 internal nodes ]

Overall sample relationship to evaluate sample outliers

Note that we created a new phyloseq object called ps1. This preserves all of the data in the original ps0 and creates a new data object with the offending sample(s) removed called ps1.

Failure to detect and remove “bad” samples can make interpreting ordinations much more challenging as they typically project as “outliers” severely skewing the rest of the samples. These samples also increase variance and will impede your ability to identify differentially abundant taxa between groups. So sample outlier removal should be a serious and thoughtful part of every analysis in order to obtain optimal results.

The next code chunk implements an MDS plot of Bray-Curtis dissimilarity. This is a simple projection of multivariant data and can be useful for identifying sample outliers similar to what we just did above. However, this takes into consideration the entire properties of the data set, and not just number of RSVs. If outliers are suspected based on this plot one should consider their removal.

# Outlier evaluation
out.bray <- ordinate(ps1, method = "MDS", distance = "bray")
p.MDS.outlier <- plot_ordination(ps1, out.bray, color = "Treatment", axes = c(1,2)) +
  theme_bw() +
  geom_point(size = 2) +
  ggtitle("MDS of Bray Distances \nOutlier Evaluation") +
  geom_text(aes(label = Well), size = 3, check_overlap = FALSE, vjust = -1)
p.MDS.outlier

## Outlier sample removal

You can see when we color code the points by treatment that there are three samples that are intermediate between Vehicle and Ampicillin treatment (labelled G1, H1, and H3), and one Ampicillin sample behaving like a Vehicle treated sample (sample F1). The samples were intentionally labelled by the sample well from the 96-well plate that was used in library creation. It is very common for the edge wells (rows A or H and columns 1 and 12) to have a higher failure rate due to evaporation during thermal cycling. Overlaying the well coordinates as text on the ordination plot shows that the samples exhibiting unexpected behavior were amplified in edge wells, except the H3 sample. Knowing the well position along with the unexpected behavior allows us to make the safe decision that these samples are likely unrepresentative of the experiment and can be justifiably documented and removed.

If you recall the violin plot generated to detect sample outliers above, there were also four samples from the Ampicillin treated samples with unexpectedly high numbers of RSVs. We did not consider those earlier, but this MDS plot suggests that there are samples from Ampicillin treated mice behaving like samples from Vehicle control mice. If you redraw the violin plot but overlay “Well” as the text variable for all samples and not a subset, you will see that samples G1, H1, H3 and in particular the sample in F1 have RSV numbers similar to the Vehicle control mice.

The fact that three of these came from edge wells is a strong argument for their removal. The sample in H3 is a more challenging decision. These samples could just be representative of the natural biological variability in the effectiveness of Ampicillin treatment in mice. Perhaps the antibiotics just didnt work as well in the mouse used in well H3.

Sample removal decisions should be made thoughtufully and considering biological and technical context. Again, the appropirate way to handle this is to document and test the affect of the removal. For this example, I am choosing to remove all of the outlier samples as indicated by the MDS and RSV violin plots (samples in well G1, H1, F1 as well as the curious sample in H3). I am doing so because I want to also exclude variance due to failed Ampicillin treatment which these plots are in support of.

# Note: DO NOT remove based on well number. Each run uses several 96-well plates, so removal by well number will potential remove serveral samples
# Always use a unique ID for sample removal

ps1 # View the original number of samples
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 443 taxa and 44 samples ]
## sample_data() Sample Data:       [ 44 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 443 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 443 tips and 441 internal nodes ]
ps2 <- ps1 %>%
  subset_samples(
    Name != "806rcbc348" &
    Name != "806rcbc360" &
    Name != "806rcbc374" &
    Name != "806rcbc372"
  )
ps2 # The sample number should be reduced according to your expectations
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 443 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 443 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 443 tips and 441 internal nodes ]
# You can redraw the plot to confirm removal
out.bray.removed <- ordinate(ps2, method = "MDS", distance = "bray")
p.MDS.outlier.removed <- plot_ordination(ps2, out.bray.removed, color = "Treatment") +
  theme_bw() +
  geom_point(size = 2) +
  ggtitle("MDS of Bray Distances \nOutliers Removed")
p.MDS.outlier.removed

##Taxon prevalence estimations and filtering

Low abundant taxa typically do not contribute to ecological community evaluation or differential abundnace testing. There are of course caveats to this statement (i.e. low-abundance pathogen detction), but many analysis can benefit from the removal of uninformative (low prevelance) taxa. Removal of low prevelance taxa greatly assist in tests penalized with a false-discovery-rate (FDR) calculation. Similar to outlier sample removal, low prevelant taxa removal should be justified and documented. The following R chunk provides several evaluations and plots to assist with this decision.

# Begin by removing sequences that were not classified as Bacteria or were classified as either mitochondria or chlorplast

ps2 # Check the number of taxa prior to removal
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 443 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 443 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 443 tips and 441 internal nodes ]
ps2 <- ps2 %>%
  subset_taxa(
    Kingdom == "k__Bacteria" &
    Family  != "f__mitochondria" &
    Class   != "c__Chloroplast"
  )
ps2 # Confirm that the taxa were removed
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 415 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 415 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 415 tips and 413 internal nodes ]
# Note the prefix (e.g. k__, f__, c__, etc.) is a GreenGenes convention. If you are using RDP or Silva you can simply remove these prefixes.

# To remove RSVs lacking phyla classification
#ps2 <- subset_taxa(ps2, !is.na(Phylum) & !Phylum %in% c("","uncharacterized"))

# Prevelance estimation
# Calculate feature prevelance across the data set
prevdf <- apply(X = otu_table(ps2),MARGIN = ifelse(taxa_are_rows(ps2), yes = 1, no = 2),FUN = function(x){sum(x > 0)})

# Add taxonomy and total read counts to prevdf
prevdf <- data.frame(Prevalence = prevdf, TotalAbundance = taxa_sums(ps2), tax_table(ps2))

# Create a table of Phylum, their mean abundances across all samples, and the number of samples they were detected in
plyr::ddply(prevdf, "Phylum", function(df1){cbind(mean(df1$Prevalence),sum(df1$Prevalence))})
##               Phylum        1    2
## 1  p__Actinobacteria 15.10000  151
## 2   p__Bacteroidetes 16.23404  763
## 3   p__Cyanobacteria  3.50000    7
## 4 p__Deferribacteres  8.00000    8
## 5      p__Firmicutes 14.43519 4677
## 6  p__Proteobacteria 14.84615  193
## 7     p__Tenericutes  9.87500  158
## 8 p__Verrucomicrobia 23.00000   46
#Prevalence plot
prevdf1 <- subset(prevdf, Phylum %in% get_taxa_unique(ps0, "Phylum"))
p.prevdf1 <- ggplot(prevdf1, aes(TotalAbundance, Prevalence / nsamples(ps2),color=Family)) +
  geom_hline(yintercept = 0.05, alpha = 0.5, linetype = 2) +
  geom_point(size = 3, alpha = 0.7) +
  scale_x_log10() +
  xlab("Total Abundance") + ylab("Prevalence [Frac. Samples]") +
  facet_wrap(~Phylum) +
  theme(legend.position="none") +
  ggtitle("Phylum Prevelence in All Samples\nColored by Family")
p.prevdf1

The code below can be reproduced using the script calleed plot_prevalance.R in the scripts folder.

This code will produce a table and a plot of all of the Phyla present in your samples along with information about their prevelance (fraction of samples they are present in) and total abundance across all samples. For this data, you can see from the table that the Cyanobacteria and Deferribacteres have both low prevelance and overall abundance. This is reflected in the plot which displays all of the unique taxa within each family. The points are colored by family, so you can see that there is only one type of Cyanobacteria and Deferribacteres, but lots of Firmicutes. There is also a ubiqutous Tenericute in the data along with another more randomly distributed Tenericute.

The plot also shows a dashed line that was subjectively chosen to cross at 5% prevelance. Everything below this line is present in fewer than 5% of all of the samples in the study.

Both the table and the plots can be used to remove low prevelant taxa. This can be done by either choosing to remove specific taxa (e.g. Deferribacteres or Cyanobacteria from the example data), or remove all taxa above or below a given threshold (e.g. below 5% prevelance). If you choose to do this you should justify and document. You should also consider analyzing both data (data with all taxa vs. data with low-prevelant taxa removed) to observe the effects of taxa removal. This will be useful for understanding how low-prevelant taxa removal influences your results. Low-prevalence filtering is a common practice precedding differential abudnace testing. Low prevelant taxa create harsh penalties on statistical tests due to multiple testing penalties and it is common practice to remove them. The R chunk below describes how to perform taxon filtering.

For the sake of example, we will go ahead and remove the low-prevalence Deferribacteres and Cyanobacteria. There is no biological basis behind this, and it would actually be detrimental if you were interested in the biology of either type of organism. But statistically they are not going to provide much information about community diversity across samples. On the other hand, they probably will not cause any problems either, so would normally be safe to leave them in. We will remove them just as an example. We will also make a new phyloseq object with taxa < 5% prevalence removed called ps3.prev.

# Remove specific taxa
# Define a variable with taxa to remove
filterPhyla = c("p__Deferribacteres", "p__Cyanobacteria")

ps2 # Check the number of taxa prior to removal
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 415 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 415 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 415 tips and 413 internal nodes ]
ps3 <- subset_taxa(ps2, !Phylum %in% filterPhyla) 
ps3 # Confirm the taxa were removed
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 412 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 412 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 412 tips and 410 internal nodes ]
# Removing taxa that fall below 5% prevelance
# Define the prevalence threshold
prevalenceThreshold = 0.05 * nsamples(ps3)
prevalenceThreshold
## [1] 2
# Define which taxa fall within the prevalence threshold
keepTaxa <- rownames(prevdf1)[(prevdf1$Prevalence >= prevalenceThreshold)]
ps3 # Check the number of taxa prior to removal
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 412 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 412 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 412 tips and 410 internal nodes ]
ps3.prev <- prune_taxa(keepTaxa, ps3)
ps3.prev # Confirm the taxa were removed
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 402 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 402 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 402 tips and 400 internal nodes ]

Data transformation

Many analysis in community ecology and hypothesis testing benefit from data transformation. Many microbiome data sets do not fit to a normal distribution, but transforming them towards normality may enable more appropriate data for specific statistical tests. The choice of transformation is not straight forward. There is literature on how frequently used transfromations affect certain analysis, but every data set may require different considerations. Therefore, it is recommended that you examine the effects of several transformations on your data and explore how they alter your results and interpretation.

The R chunk below implements several commonly used transformations in microbiome research and plots their results. Similar to outlier removal and prevalance filtering, your choice should be justified, tested and documented.

# Transform to Realative abundances
ps3.ra <- transform_sample_counts(ps3, function(OTU) OTU/sum(OTU))

# Transform to Proportional Abundance
ps3.prop <- transform_sample_counts(ps3, function(x) min(sample_sums(ps3)) * x/sum(x))

# Log transformation moves to a more normal distribution
ps3.log <- transform_sample_counts(ps3, function(x) log(1 + x))

# View how each function altered count data
par(mfrow=c(1,4))
plot(sort(sample_sums(ps3), TRUE), type = "o", main = "Native", ylab = "RSVs", xlab = "Samples")
plot(sort(sample_sums(ps3.log), TRUE), type = "o", main = "log Transfromed", ylab = "RSVs", xlab = "Samples")
plot(sort(sample_sums(ps3.ra), TRUE), type = "o", main = "Relative Abundance", ylab = "RSVs", xlab = "Samples")
plot(sort(sample_sums(ps3.prop), TRUE), type = "o", main = "Proportional Abundance", ylab = "RSVs", xlab = "Samples")

par(mfrow=c(1,1))

# Histograms of the non-transformed data vs. the transformed data can address the shift to normality
p.nolog <- qplot(rowSums(otu_table(ps3))) + ggtitle("Raw Counts") +
  theme_bw() +
  xlab("Row Sum") +
  ylab("# of Samples")

p.log <- qplot(log10(rowSums(otu_table(ps3)))) +
  ggtitle("log10 transformed counts") +
  theme_bw() +
  xlab("Row Sum") +
  ylab("# of Samples")

grid.arrange(p.nolog, p.log, ncol = 2)

##Subsetting (optional)

You will frequently find that you want to analyze a subset of your total data set. There are typically commands that will allow you to do this for each individual analysis, but similar to variable reordering it can sometime be more convenient to do this towards the beginning of your analysis. This should be done after removal of outlier samples and taxa. If you wish to create transformed versions of each subset you can either subset the transformed data you just generated, or alternatively retransform your subsetted data. The R chunk below is an example subsetting of the example data by treatment.

Subsetting away samples can create a situation where taxa are present as empty rows. This is because not every sample has every taxa. These can be removed as shown in the R chunk below.

Creating individual subsets like this can be particularly useful when assessing differential abundance using DeSeq2.

# Make a subset of mice treated with Ampicillin
ps3 # Check the original number of samples
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 412 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 412 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 412 tips and 410 internal nodes ]
ps3.amp <- subset_samples(ps3, Treatment == "Ampicillin")
ps3.amp # Check that the number of samples is correct following the subsetting
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 412 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 412 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 412 tips and 410 internal nodes ]
any(taxa_sums(ps3.amp) == 0) # In this case it is TRUE, so remove the zero's
## [1] TRUE
ps3.amp <- prune_taxa(taxa_sums(ps3.amp) > 0, ps3.amp)
ps3.amp # Number of taxa should be smaller now
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 359 taxa and 26 samples ]
## sample_data() Sample Data:       [ 26 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 359 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 359 tips and 357 internal nodes ]
any(taxa_sums(ps3.amp) == 0) # It should now be false
## [1] FALSE
# Make a subset of mice treated with Vehicle
ps3 # Check the original number of samples
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 412 taxa and 40 samples ]
## sample_data() Sample Data:       [ 40 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 412 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 412 tips and 410 internal nodes ]
ps3.vehicle <- subset_samples(ps3, Treatment == "Vehicle")
ps3.vehicle # Check that the number of svehicleles is correct following the subsetting
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 412 taxa and 14 samples ]
## sample_data() Sample Data:       [ 14 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 412 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 412 tips and 410 internal nodes ]
any(taxa_sums(ps3.vehicle) == 0) # In this case it is TRUE, so remove the zero's
## [1] TRUE
ps3.vehicle <- prune_taxa(taxa_sums(ps3.vehicle) > 0, ps3.vehicle)
ps3.vehicle # Number of taxa should be smaller now
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 341 taxa and 14 samples ]
## sample_data() Sample Data:       [ 14 samples by 26 sample variables ]
## tax_table()   Taxonomy Table:    [ 341 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 341 tips and 339 internal nodes ]
any(taxa_sums(ps3.vehicle) == 0) # It should now be false
## [1] FALSE

Community composition plotting

Bar plots can be useful for examining overall taxon representation across your samples. Importantly, they do not portray any information about the overall abundance of each taxa (just relative abundance) and should not be usef for making anything more than high-level visual representations of the community composition of your data.

# Create a data table for ggploting
ps3_phylum <- ps3 %>%
  tax_glom(taxrank = "Phylum") %>%                     # agglomerate at phylum level
  transform_sample_counts(function(x) {x/sum(x)} ) %>% # Transform to rel. abundance (or use ps0.ra)
  psmelt() %>%                                         # Melt to long format for easy ggploting
  filter(Abundance > 0.01)                             # Filter out low abundance taxa

# Plot - Phylum
p.ra.phylum <- ggplot(ps3_phylum, aes(x = Name, y = Abundance, fill = Phylum)) + 
  geom_bar(stat = "identity", width = 1) +
  facet_grid(~Treatment, scales = "free_x") +
  theme(axis.text.x = element_blank()) +
  theme(axis.title.x = element_blank()) +
  ggtitle("Abundant Phylum (> 1%)")
p.ra.phylum

# Note: This is a nice place to output tables of data that you may want to use for other analysis, or to include as supplemental data for publication
# You can rerun the first bit of code in this chunk and change Phylum to Species for a table with all possible classifications
write.table(ps3_phylum, file = "phylum_relab.txt", sep = "\t")

ggplotly(p.ra.phylum)

These plots show that Ampicillin treatment changes the ratio of Bacteroidetes to Firmicutes and results in the expansion of Tenericutes and Proteobacteria.

Alpha diversity analysis

Alpha diversity is diversity of species on a local scale such as a specific environmental sampling site or patient sample. Species diversity consists of two componentsL species richness and species evenness. Species richness is simply the count of species at a site while species evenness takes into account there abundances. Equal abundances indicates high evenness and low diversity, unequal abundnaces indicates low evenness and high diversity.

The basic properties can be useful for determining high-level differences between study groups. Importantly, they do not take into account any taxonomic information, so any differences you see require further investigations into what taxa are driving differences in alpha diversity.

To test for differences in alpha diversity we will compare the mean diversity of different diversity measures between groups using a t-test and use box or violin plots for display. The R chunk below implements two measures: 1) Observed which is the observed number of species, or richness, 2) Shannon Diversity which is a standard diversity measure. Other measures are available in estimate_richness options.

Alpha diversity summary information generation

# Diversity
diversity <- global(ps3)
head(diversity)
##            richness_0 richness_20 richness_50 richness_80
## 806rcbc333        236         236         236         187
## 806rcbc334        246         246         246         174
## 806rcbc335        255         255         255         213
## 806rcbc336        222         222         222         148
## 806rcbc337        199         199         199         128
## 806rcbc338        234         234         234         178
##            diversities_inverse_simpson diversities_gini_simpson
## 806rcbc333                   11.139720                0.9102311
## 806rcbc334                    7.368930                0.8642951
## 806rcbc335                   11.768307                0.9150260
## 806rcbc336                   10.220625                0.9021586
## 806rcbc337                    5.157725                0.8061161
## 806rcbc338                    8.060761                0.8759422
##            diversities_shannon diversities_fisher diversities_coverage
## 806rcbc333            3.275649           32.59279                    4
## 806rcbc334            2.906675           34.90328                    3
## 806rcbc335            3.426973           35.42800                    4
## 806rcbc336            2.963711           31.89358                    4
## 806rcbc337            2.292101           27.78067                    2
## 806rcbc338            3.036928           34.16407                    3
##            evenness_camargo evenness_pielou evenness_simpson evenness_evar
## 806rcbc333       0.06699647       0.5995149       0.02703816     0.1738477
## 806rcbc334       0.05223827       0.5279745       0.01788575     0.1742576
## 806rcbc335       0.07982035       0.6184461       0.02856385     0.1801690
## 806rcbc336       0.04564875       0.5485633       0.02480734     0.1689148
## 806rcbc337       0.02749613       0.4330189       0.01251875     0.1795631
## 806rcbc338       0.06114635       0.5566909       0.01956495     0.1846712
##            evenness_bulla dominance_DBP dominance_DMN dominance_absolute
## 806rcbc333      0.1874398     0.1989041     0.3585072               9039
## 806rcbc334      0.1613208     0.2547290     0.4956511              10221
## 806rcbc335      0.2112214     0.2093259     0.3589380               9903
## 806rcbc336      0.1373706     0.1604751     0.3085372               5391
## 806rcbc337      0.1044051     0.2970703     0.5764509              10647
## 806rcbc338      0.1839894     0.2795278     0.4198198               8998
##            dominance_relative dominance_simpson dominance_core_abundance
## 806rcbc333          0.1989041        0.08976886                0.6612974
## 806rcbc334          0.2547290        0.13570492                0.8227040
## 806rcbc335          0.2093259        0.08497399                0.6436196
## 806rcbc336          0.1604751        0.09784137                0.8071084
## 806rcbc337          0.2970703        0.19388393                0.9112165
## 806rcbc338          0.2795278        0.12405776                0.7901522
##            dominance_gini rarity_log_modulo_skewness rarity_low_abundance
## 806rcbc333      0.9330035                   2.059167           0.08064871
## 806rcbc334      0.9477617                   2.060759           0.07170093
## 806rcbc335      0.9201797                   2.058312           0.08201399
## 806rcbc336      0.9543512                   2.059651           0.06435673
## 806rcbc337      0.9725039                   2.060607           0.05094866
## 806rcbc338      0.9388537                   2.061208           0.07555141
##            rarity_rare_abundance
## 806rcbc333            0.25325676
## 806rcbc334            0.13201246
## 806rcbc335            0.29823078
## 806rcbc336            0.16895874
## 806rcbc337            0.06819196
## 806rcbc338            0.16321839
sd.3 <- as.data.frame(sample_data(ps3)) # useful to coerce the phyloseq object into an R data frame
ps3.rich <- merge(sd.3, diversity, by ="row.names") # merge sd.1 by row names

# Add divergence measurements
ps3.rich$divergence <- divergence(ps3)
p.rich.treatment <- ggboxplot(ps3.rich, x = "Treatment", y = "richness_0", outlier.shape = NA) +
  geom_jitter(width = 0.2) +
  # facet_grid(~Week) +
  ylab("Richness") +
  theme(axis.title.x = element_blank()) +
  stat_compare_means(method = "t.test")

p.sd.treatment <- ggboxplot(ps3.rich, x = "Treatment", y = "diversities_shannon", outlier.shape = NA) +
  geom_jitter(width = 0.2) +
  # facet_grid(~Week) +
  ylab("Shannon diversity") +
  theme(axis.title.x = element_blank()) +
  stat_compare_means(method = "t.test")

grid.arrange(p.rich.treatment, p.sd.treatment, ncol = 2)

Analysis of alpha diversity along continuous variables can be visualized using visually weighted regression plots (http://www.fight-entropy.com/2012/07/visually-weighted-regression.html).

# The plot_regression command from the microbiome package requires phyloseq objects
# For the violin plots above we combined alpha diversity measurements with sample data as a data frame to generate ggplot ready data
# We will take the same approach here, but will combine alpha diversity measurements with sample data in the phyloseq object to prepare it for microbiome

# Diversity
diversity.vehicle <- global(ps3.vehicle)
diversity.amp <- global(ps3.amp)
head(diversity.amp)
##            richness_0 richness_20 richness_50 richness_80
## 806rcbc349         64          64          64          36
## 806rcbc350         91          91          91          46
## 806rcbc351         76          76          76          45
## 806rcbc352         80          80          80          48
## 806rcbc353        123         123         123          77
## 806rcbc354        181         181         181         113
##            diversities_inverse_simpson diversities_gini_simpson
## 806rcbc349                   15.381904                0.9349885
## 806rcbc350                    7.507538                0.8668005
## 806rcbc351                    8.357182                0.8803424
## 806rcbc352                   18.134298                0.9448559
## 806rcbc353                   14.146269                0.9293100
## 806rcbc354                   23.785381                0.9579574
##            diversities_shannon diversities_fisher diversities_coverage
## 806rcbc349            3.320775           23.22971                    7
## 806rcbc350            3.004054           28.33299                    3
## 806rcbc351            3.080778           24.74539                    4
## 806rcbc352            3.573072           30.11120                    8
## 806rcbc353            3.415539           35.53045                    5
## 806rcbc354            3.914249           49.81828                    8
##            evenness_camargo evenness_pielou evenness_simpson evenness_evar
## 806rcbc349       0.06429491       0.7984777       0.04284653     0.4510124
## 806rcbc350       0.06192097       0.6659605       0.02091236     0.4124210
## 806rcbc351       0.06107885       0.7113756       0.02327906     0.4385144
## 806rcbc352       0.08464755       0.8153925       0.05051336     0.4906040
## 806rcbc353       0.08254724       0.7097689       0.03940465     0.3828727
## 806rcbc354       0.12645936       0.7529578       0.06625454     0.3508831
##            evenness_bulla dominance_DBP dominance_DMN dominance_absolute
## 806rcbc349      0.1759777    0.18421053     0.2748538                 63
## 806rcbc350      0.1925512    0.31259259     0.4711111                211
## 806rcbc351      0.1839789    0.30844794     0.4066798                157
## 806rcbc352      0.2117094    0.16290727     0.2756892                 65
## 806rcbc353      0.2248285    0.14949863     0.2816773                164
## 806rcbc354      0.2778637    0.09645777     0.1874659                177
##            dominance_relative dominance_simpson dominance_core_abundance
## 806rcbc349         0.18421053        0.06501146                0.8654971
## 806rcbc350         0.31259259        0.13319945                0.8622222
## 806rcbc351         0.30844794        0.11965756                0.8644401
## 806rcbc352         0.16290727        0.05514413                0.8320802
## 806rcbc353         0.14949863        0.07069001                0.8559708
## 806rcbc354         0.09645777        0.04204263                0.6850136
##            dominance_gini rarity_log_modulo_skewness rarity_low_abundance
## 806rcbc349      0.9357051                   2.037462           0.00000000
## 806rcbc350      0.9380790                   2.050175           0.06666667
## 806rcbc351      0.9389211                   2.049303           0.06090373
## 806rcbc352      0.9153524                   2.039103           0.00000000
## 806rcbc353      0.9174528                   2.054327           0.09662716
## 806rcbc354      0.8735406                   2.044635           0.10354223
##            rarity_rare_abundance
## 806rcbc349            0.07894737
## 806rcbc350            0.10222222
## 806rcbc351            0.09823183
## 806rcbc352            0.12531328
## 806rcbc353            0.09845032
## 806rcbc354            0.26049046
# Combine diversity into metadata
sample_data(ps3.vehicle)$Shannon <- diversity.vehicle$diversities_shannon
sample_data(ps3.vehicle)$Richness <- diversity.vehicle$richness_0

sample_data(ps3.amp)$Shannon <- diversity.amp$diversities_shannon
sample_data(ps3.amp)$Richness <- diversity.amp$richness_0

colnames(sample_data(ps3.vehicle))
##  [1] "X.SampleID"           "NumberReads"          "BarcodeSequence"     
##  [4] "LinkerPrimerSequence" "Plate"                "Well"                
##  [7] "Name"                 "SampleNo"             "DNAPlateNumber"      
## [10] "Cell"                 "Experiment"           "Microbiota"          
## [13] "Sex"                  "GroupedCage"          "IndividualCage"      
## [16] "DayPostTreatment"     "Treatment"            "Virus"               
## [19] "DayPostInoculation"   "SurvivalStatus"       "DaysToDeath"         
## [22] "SampleCollected"      "SampesProcessed"      "Description"         
## [25] "continuous_var_1"     "continuous_var_2"     "Shannon"             
## [28] "Richness"
colnames(sample_data(ps3.amp))
##  [1] "X.SampleID"           "NumberReads"          "BarcodeSequence"     
##  [4] "LinkerPrimerSequence" "Plate"                "Well"                
##  [7] "Name"                 "SampleNo"             "DNAPlateNumber"      
## [10] "Cell"                 "Experiment"           "Microbiota"          
## [13] "Sex"                  "GroupedCage"          "IndividualCage"      
## [16] "DayPostTreatment"     "Treatment"            "Virus"               
## [19] "DayPostInoculation"   "SurvivalStatus"       "DaysToDeath"         
## [22] "SampleCollected"      "SampesProcessed"      "Description"         
## [25] "continuous_var_1"     "continuous_var_2"     "Shannon"             
## [28] "Richness"
p.regr.veh <- plot_regression(Shannon ~ Richness, meta(ps3.vehicle), spag = TRUE) +
  ggtitle("Vehicle Control Mice") +
  ylab("Shannon Diversity") +
  xlab("Richness")

p.regr.amp <- plot_regression(Shannon ~ Richness, meta(ps3.amp), spag = TRUE) +
  ggtitle("Ampicillin Treated Mice") +
  ylab("Shannon Diversity") +
  xlab("Richness")

grid.arrange(p.regr.veh, p.regr.amp, nrow = 2)

Correlation between measures of alpha diversity and continuous variables can be assessed using the cor.test function of base R and the corrplot package for measuring multiple variables at once.

cor.vehicle <- cor(diversity.vehicle)
cor.amp <- cor(diversity.amp)

corrplot(cor.vehicle)

corrplot(cor.amp)

# If you wanted to compare with data with sample metadata you can merge the entire diversity data frame to your sample data
ps3.div.vehicle <- merge(sample_data(ps3.vehicle), diversity.vehicle, by = "row.names")
ps3.div.amp <- merge(sample_data(ps3.amp), diversity.amp, by = "row.names")

# Remove non-numeric variables
ps3.div.vehicle <- select_if(ps3.div.vehicle, is.numeric)
ps3.div.amp <- select_if(ps3.div.amp, is.numeric)

# Remove uninteresting variables, in this case the first 5 variables are not needed
ps3.div.vehicle <- subset(ps3.div.vehicle, select = -c(1:5))
ps3.div.amp <- subset(ps3.div.amp, select = -c(1:5))

# Run correlations
ps3.div.v.cor <- cor(ps3.div.vehicle)
ps3.div.a.cor <- cor(ps3.div.amp)

# Correlation plots
corrplot(ps3.div.v.cor)

corrplot(ps3.div.a.cor)

These results demonstrate how the average number of bacteria detected in ampicillin treated mice decreases as expected following antibiotic treatment. The diversity increases which is also reflected through the increase in Tenericutes and Proteobacteria in the community composition bar plots.

Beta diversity analysis

Beta diversity is the extent of difference in community composition between sites or along a gradient. There are a large number of ways to analyze and visualize beta diversity, but we are basically looking for a ordination to help us determine how different samples from different groups are. For example, we want to project the diversity of each sample as a point across an ordination plot that explains as as much variance in the data as possible and in an unconstrained manner. We then color each sample by a grouping variable (e.g. Treatment, Infection, Disease, Sex) and look for distinct grouping.

One can run statistical tests to determine if the null hypothesis that the groups are not dissimilar should be rejected at a given alpha.

This first R chunk will implement two ordinations using two commonly used phylogenetic diversity measures, unifrac and weighted unifrac. Weighted unifrac takes into account the relative abundnaces of each taxa, while unifrac does not. Both measures take into account the phylogenetic relatedness of each taxa. There are a large number of measures that can be interrogated.

Key factors that one may want to consdier when approaching a beta diversity analysis are:

  1. Transformed vs. untransformed data
  2. Which data transform to use
  3. Distance measure to use
# Ordinate
ord.pcoa.uni <- ordinate(ps3, method = "PCoA", distance = "unifrac")
ord.pcoa.wuni <- ordinate(ps3, method = "PCoA", distance = "wunifrac")

# Plots
p.pcoa.uni <- plot_ordination(ps3, ord.pcoa.uni, color = "GroupedCage") +
  geom_point(size=5, alpha = 0.7) +
  geom_point(colour = "grey50", size = 1.5) +
  facet_grid(~ Treatment) +
  stat_ellipse(type = "norm") +
  ggtitle("PCoA | Unifrac")

p.pcoa.wuni <- plot_ordination(ps3, ord.pcoa.wuni, color = "SurvivalStatus") +
  geom_point(size=5, alpha = 0.7) +
  geom_point(colour = "grey50", size = 1.5) +
  facet_grid(~ Treatment) +
  stat_ellipse(type = "norm") +
  ggtitle("PCoA | wUnifrac")

grid.arrange(p.pcoa.uni, p.pcoa.wuni, nrow = 2)

For fun the points are color coded by the cage each animal originate from or survival status . The take home message from these plots is that the samples are separated by treatment, and there isnt much of any separation by cage. This is to be expected from the experimental design. The ADONIS test can be used to test if the treatment groups are significantly different.

Group significance testing with ADONIS

# Set a random seed so that exact results can be reproduced
set.seed(1000)

# Function to run adonis test on a physeq object and a variable from metadata 
doadonis <- function(physeq, category) {
  bdist <- phyloseq::distance(physeq, "bray")
  col <- as(sample_data(physeq), "data.frame")[ ,category]
  
  # Adonis test
  adonis.bdist <- adonis(bdist ~ col)
  print("Adonis results:")
  print(adonis.bdist)
  
  # Homogeneity of dispersion test
  betatax = betadisper(bdist,col)
  p = permutest(betatax)
  print("Betadisper results:")
  print(p$tab)
}

# Runs Permanov and Homogeneity of dispersion test
# See ?betadisper for more info
doadonis(ps0, "Treatment")
## [1] "Adonis results:"
## 
## Call:
## adonis(formula = bdist ~ col) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## col        1    5.2961  5.2961  29.122 0.40378  0.001 ***
## Residuals 43    7.8201  0.1819         0.59622           
## Total     44   13.1162                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Betadisper results:"
##           Df     Sum Sq    Mean Sq        F N.Perm Pr(>F)
## Groups     1 0.07064695 0.07064695 2.394681    999  0.132
## Residuals 43 1.26856946 0.02950162       NA     NA     NA

You can see from the highly significant p-value that treatment does indeed indicate that the groups are dissimilar. The Homogeneity of Dispersion is a test for how variable the sample groups are. For this data, the significant p-value suggets that the variance is significantly different between the groups as well.

The second doadonis command tests whether or not mice that survived or died as a result of treatment were dissimilar. This example was included as it demonstrates the utility of subsetting as we only wanted to run the test on ampicillin treated samples.

Differential abundance testing

# Differential Abundance - Baseline ~ CD4 and VL Category
ds <- phyloseq_to_deseq2(ps0, ~Treatment) # Change formula to VL_category and rerun to test VL_category
dds <- DESeq(ds, test="Wald", fitType="local", betaPrior = FALSE) # Worth trying parametric and local fitTypes

alpha = 0.05

# Tabulate and write results
res.dds = results(dds, cooksCutoff = FALSE)
sigtab_dds = res.dds[which(res.dds$padj < alpha), ]
sigtab_dds = cbind(as(sigtab_dds, "data.frame"), as(tax_table(ps0)[rownames(sigtab_dds), ], "matrix"))
head(sigtab_dds)
##                                                                                                                                                                                                                                                           baseMean
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA 327.03171
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC  235.35751
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA 100.85545
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA  38.55688
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA 481.24391
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC  101.87178
##                                                                                                                                                                                                                                                          log2FoldChange
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA      -1.677454
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC       -1.758224
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA      -1.042287
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA      -1.760693
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA       4.830774
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC        1.668447
##                                                                                                                                                                                                                                                              lfcSE
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA 0.3647439
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC  0.3972025
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA 0.3449343
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA 0.3789173
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA 0.6194682
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC  0.4579249
##                                                                                                                                                                                                                                                               stat
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA -4.598992
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC  -4.426519
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA -3.021696
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA -4.646642
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA  7.798261
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC   3.643494
##                                                                                                                                                                                                                                                                pvalue
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA 4.245399e-06
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC  9.576597e-06
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA 2.513628e-03
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA 3.373819e-06
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA 6.276629e-15
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC  2.689617e-04
##                                                                                                                                                                                                                                                                  padj
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA 2.843131e-05
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC  6.224788e-05
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA 9.415452e-03
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA 2.330044e-05
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA 7.300710e-14
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC  1.350921e-03
##                                                                                                                                                                                                                                                              Kingdom
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA k__Bacteria
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC  k__Bacteria
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA k__Bacteria
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA k__Bacteria
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA k__Bacteria
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC  k__Bacteria
##                                                                                                                                                                                                                                                                    Phylum
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA p__Bacteroidetes
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC  p__Bacteroidetes
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA p__Bacteroidetes
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA p__Bacteroidetes
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA   p__Tenericutes
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC     p__Firmicutes
##                                                                                                                                                                                                                                                                   Class
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA c__Bacteroidia
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC  c__Bacteroidia
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA c__Bacteroidia
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA c__Bacteroidia
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA  c__Mollicutes
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC      c__Bacilli
##                                                                                                                                                                                                                                                                         Order
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA     o__Bacteroidales
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC      o__Bacteroidales
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA     o__Bacteroidales
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA     o__Bacteroidales
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA o__Anaeroplasmatales
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC    o__Lactobacillales
##                                                                                                                                                                                                                                                                         Family
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA              f__S24-7
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC       f__Rikenellaceae
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA              f__S24-7
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA              f__S24-7
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA f__Anaeroplasmataceae
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC    f__Lactobacillaceae
##                                                                                                                                                                                                                                                                     Genus
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA              g__
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC               g__
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA              g__
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA              g__
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA  g__Anaeroplasma
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC  g__Lactobacillus
##                                                                                                                                                                                                                                                          Species
## CGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACA     s__
## GGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCGGATAAGTTAGAGGTGAAATCCCGAGGCTCAACTTCGGAATTGCCTCTGATACTGTTCGGCTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAGCTACTTCTGACGTTGAGGCACGAAAGCGTGGGGAGCAAAC      s__
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCAGACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA     s__
## GGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTCCGTTAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTCGAGCCGTTGAAACTGGCAGACTTGAGTTGGCGAGAAGTACGCGGAATGCGCGGTGTAGCGGTGAAATGCATAGATATCGCGCAGAACTCCGATTGCGAAGGCAGCGTACCGGCGCCACACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACA     s__
## CGTAGGGAGCGAGCGTTATCCGGAATTATTGGGCGTAAAGGGTGCGTAGATGGCATAGTAAGTCTTTTGTAAAAATGCTGGGCTCAACCCAGTAGGGCAAAAGATACTGCAAAGCTAGAGTATGACAGAGGCAAGTGGAACTACATGTGTAGCGGTAAAATGCGTAAATATATGTAAGAACACCAGTGGCGAAGGCGGCTTGCTGGGTCGATACTGACATTGAGGCACGAAAGCGTGGGGAGCAAACA     s__
## GTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAAC      s__
write.table(sigtab_dds, file="./results/deseq_results.txt", sep = "\t") # Change filename to save results to appropriate file

# Simple plot of results
x.ds = tapply(sigtab_dds$log2FoldChange, sigtab_dds$Family, function(x) max(x))
x.ds = sort(x.ds, TRUE)
sigtab_dds$Species = factor(as.character(sigtab_dds$Family), levels=names(x.ds))
p.deseq <- ggplot(sigtab_dds, aes(x=Family, y=log2FoldChange, color=Genus)) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_point(size=5, alpha = 0.7) + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5)) +
  ggtitle("Taxa Differentially Associated by Treatment") + 
  geom_hline(yintercept = 0, lwd=1.5)
p.deseq

# Quick check of factor levels
mcols(res.dds, use.names = TRUE)
## DataFrame with 6 rows and 2 columns
##                        type
##                 <character>
## baseMean       intermediate
## log2FoldChange      results
## lfcSE               results
## stat                results
## pvalue              results
## padj                results
##                                                            description
##                                                            <character>
## baseMean                     mean of normalized counts for all samples
## log2FoldChange log2 fold change (MLE): Treatment Ampicillin vs Vehicle
## lfcSE                  standard error: Treatment Ampicillin vs Vehicle
## stat                   Wald statistic: Treatment Ampicillin vs Vehicle
## pvalue              Wald test p-value: Treatment Ampicillin vs Vehicle
## padj                                              BH adjusted p-values

Volcano plot of differential abundance testing results

Volcano plots are useful for visualizing not only the differentially abundnat taxa, but those taxa which did not pass the alpha threshold. Interactive plots using plotly can aid in interrogating the complex arrangements of points.

# Prepare to join results data.table and taxonomy table
resdt = data.table(as(results(dds, cooksCutoff = FALSE), "data.frame"),
                   keep.rownames = TRUE)
setnames(resdt, "rn", "OTU")
taxdt = data.table(data.frame(as(tax_table(ps0), "matrix")), keep.rownames = TRUE)
setnames(taxdt, "rn", "OTU")

# Join results data.table and taxonomy table
setkeyv(taxdt, "OTU")
setkeyv(resdt, "OTU")
resdt <- taxdt[resdt]
resdt
##                                                                                                                                                                                                                                                            OTU
##   1: ACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAACTGGTGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGCAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGAGTGCGAAAGCATGGGGAGCGAACGG
##   2: ACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAAGTGGCGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGTAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGGGTGCGAAAGCATGGGGAGCGAACAG
##   3:   AGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTCAAGTCCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCATTGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCAAAT
##   4:   AGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTTAAGTTCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCGTAGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAAT
##   5:   AGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTTAAGTTCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGTGGTGAAAACTACTAAGCTAGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGATGAAATGCGTAGAGATCGGAAGGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAAT
##  ---                                                                                                                                                                                                                                                          
## 439:    TAGGGGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGCGTAGGCGGCATGGTAAGCCAGATGTGAAAGCCCGCGGCTTAACCGCGCGGATTGCATTTGGAACTATCAAGCTAGAGTACAGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAAGAACACCAGTGGCGAAGGCGGCTTTCTGGACTGAAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAA
## 440:    TAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGTGGCATGGCAAGCCAGAAGTGAAAACCCGGGGCTTAACCCCGCGGATTGCTTTTGGAACTGTCAGGCTGGAGTGCAGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACTGTAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAA
## 441:    TAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGTGGCATGGTAAGCCAGAAGTGAAAACCCAGGGCTCAACTCTGTGGATTGCTTTTGGAACTATCAAGCTAGAGTGCTGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGCTTACTGGACAGAAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAA
## 442:    TAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGTGGCATGGTAAGCCAGAAGTGAAAACCCGGGGCTCAACCCCGCGGATTGCTTTTGGAACTATCAGGCTGGAGTGCTGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGCTTACTGGACAGAAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAA
## 443:    TAGGGGGCAAGCGTTGTTCGGAATGACTGGGCGTAAAGGGTGTGTAGGTGGTTTGATAAGTTAGATGTGTAATACCCAGGGCTTAACTCGGGTGCTGCATCTAAAACTGTTAGACTTGAGTACTGGAGAGGATAGTGGAATTCCTAGTGTAGCGGTGGAATGCGTAGATATTAGGAGGAACATCAGTGGCGAAGGCGACTATCTGGACAGCAACTGACACTGAGGCACGAAAGCGTGGGGAGCAAA
##          Kingdom            Phylum                  Class            Order
##   1: k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales
##   2: k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales
##   3: k__Bacteria  p__Cyanobacteria         c__Chloroplast  o__Streptophyta
##   4: k__Bacteria  p__Cyanobacteria         c__Chloroplast  o__Streptophyta
##   5: k__Bacteria  p__Cyanobacteria         c__Chloroplast  o__Streptophyta
##  ---                                                                      
## 439: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 440: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 441: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 442: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 443: k__Bacteria p__Proteobacteria                     NA               NA
##                  Family          Genus      Species    baseMean
##   1:    f__mitochondria     g__Lupinus    s__luteus   8.1539852
##   2:    f__mitochondria         g__Zea s__luxurians  74.2677521
##   3:                f__            g__          s__  20.9865616
##   4:                f__            g__          s__ 105.1752823
##   5:                f__            g__          s__   2.2078458
##  ---                                                           
## 439:                 NA             NA           NA   0.1974782
## 440:                f__            g__          s__   1.2531111
## 441: f__Lachnospiraceae g__Coprococcus          s__   0.4991015
## 442: f__Lachnospiraceae g__Coprococcus          s__   0.1049244
## 443:                 NA             NA           NA   0.7233974
##      log2FoldChange     lfcSE       stat       pvalue         padj
##   1:      9.3321835 0.8224754 11.3464590 7.722724e-30 4.876349e-28
##   2:      5.7485723 0.9156034  6.2784526 3.419592e-10 3.285781e-09
##   3:      3.8363204 0.9407839  4.0777914 4.546553e-05 2.543768e-04
##   4:      3.6737619 1.0525083  3.4904826 4.821490e-04 2.254025e-03
##   5:      7.5856134 0.9096882  8.3386964 7.511401e-17 9.764822e-16
##  ---                                                              
## 439:      0.9060222 0.8051730  1.1252516 2.604824e-01 3.725995e-01
## 440:      2.0331237 0.6033231  3.3698752 7.520224e-04 3.426741e-03
## 441:      0.2409729 0.6382957  0.3775254 7.057832e-01 7.645984e-01
## 442:      2.2740985 1.0726519  2.1200713 3.400003e-02 8.079578e-02
## 443:      1.5507372 1.2630825  1.2277403 2.195445e-01 3.291055e-01
resdt[, Significant := padj < alpha]
resdt[!is.na(Significant)]
##                                                                                                                                                                                                                                                            OTU
##   1: ACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAACTGGTGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGCAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGAGTGCGAAAGCATGGGGAGCGAACGG
##   2: ACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAAGTGGCGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGTAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGGGTGCGAAAGCATGGGGAGCGAACAG
##   3:   AGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTCAAGTCCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCATTGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCAAAT
##   4:   AGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTTAAGTTCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCGTAGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAAT
##   5:   AGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTTAAGTTCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGTGGTGAAAACTACTAAGCTAGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGATGAAATGCGTAGAGATCGGAAGGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAAT
##  ---                                                                                                                                                                                                                                                          
## 438:    TAGGGGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGCGTAGGCGGCATGGTAAGCCAGATGTGAAAGCCCGCGGCTTAACCGCGCGGATTGCATTTGGAACTATCAAGCTAGAGTACAGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAAGAACACCAGTGGCGAAGGCGGCTTTCTGGACTGAAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAA
## 439:    TAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGTGGCATGGCAAGCCAGAAGTGAAAACCCGGGGCTTAACCCCGCGGATTGCTTTTGGAACTGTCAGGCTGGAGTGCAGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACTGTAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAA
## 440:    TAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGTGGCATGGTAAGCCAGAAGTGAAAACCCAGGGCTCAACTCTGTGGATTGCTTTTGGAACTATCAAGCTAGAGTGCTGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGCTTACTGGACAGAAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAA
## 441:    TAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGTGGCATGGTAAGCCAGAAGTGAAAACCCGGGGCTCAACCCCGCGGATTGCTTTTGGAACTATCAGGCTGGAGTGCTGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGCTTACTGGACAGAAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAA
## 442:    TAGGGGGCAAGCGTTGTTCGGAATGACTGGGCGTAAAGGGTGTGTAGGTGGTTTGATAAGTTAGATGTGTAATACCCAGGGCTTAACTCGGGTGCTGCATCTAAAACTGTTAGACTTGAGTACTGGAGAGGATAGTGGAATTCCTAGTGTAGCGGTGGAATGCGTAGATATTAGGAGGAACATCAGTGGCGAAGGCGACTATCTGGACAGCAACTGACACTGAGGCACGAAAGCGTGGGGAGCAAA
##          Kingdom            Phylum                  Class            Order
##   1: k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales
##   2: k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales
##   3: k__Bacteria  p__Cyanobacteria         c__Chloroplast  o__Streptophyta
##   4: k__Bacteria  p__Cyanobacteria         c__Chloroplast  o__Streptophyta
##   5: k__Bacteria  p__Cyanobacteria         c__Chloroplast  o__Streptophyta
##  ---                                                                      
## 438: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 439: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 440: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 441: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 442: k__Bacteria p__Proteobacteria                     NA               NA
##                  Family          Genus      Species    baseMean
##   1:    f__mitochondria     g__Lupinus    s__luteus   8.1539852
##   2:    f__mitochondria         g__Zea s__luxurians  74.2677521
##   3:                f__            g__          s__  20.9865616
##   4:                f__            g__          s__ 105.1752823
##   5:                f__            g__          s__   2.2078458
##  ---                                                           
## 438:                 NA             NA           NA   0.1974782
## 439:                f__            g__          s__   1.2531111
## 440: f__Lachnospiraceae g__Coprococcus          s__   0.4991015
## 441: f__Lachnospiraceae g__Coprococcus          s__   0.1049244
## 442:                 NA             NA           NA   0.7233974
##      log2FoldChange     lfcSE       stat       pvalue         padj
##   1:      9.3321835 0.8224754 11.3464590 7.722724e-30 4.876349e-28
##   2:      5.7485723 0.9156034  6.2784526 3.419592e-10 3.285781e-09
##   3:      3.8363204 0.9407839  4.0777914 4.546553e-05 2.543768e-04
##   4:      3.6737619 1.0525083  3.4904826 4.821490e-04 2.254025e-03
##   5:      7.5856134 0.9096882  8.3386964 7.511401e-17 9.764822e-16
##  ---                                                              
## 438:      0.9060222 0.8051730  1.1252516 2.604824e-01 3.725995e-01
## 439:      2.0331237 0.6033231  3.3698752 7.520224e-04 3.426741e-03
## 440:      0.2409729 0.6382957  0.3775254 7.057832e-01 7.645984e-01
## 441:      2.2740985 1.0726519  2.1200713 3.400003e-02 8.079578e-02
## 442:      1.5507372 1.2630825  1.2277403 2.195445e-01 3.291055e-01
##      Significant
##   1:        TRUE
##   2:        TRUE
##   3:        TRUE
##   4:        TRUE
##   5:        TRUE
##  ---            
## 438:       FALSE
## 439:        TRUE
## 440:       FALSE
## 441:       FALSE
## 442:       FALSE
resdt
##                                                                                                                                                                                                                                                            OTU
##   1: ACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAACTGGTGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGCAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGAGTGCGAAAGCATGGGGAGCGAACGG
##   2: ACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAAGTGGCGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGTAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGGGTGCGAAAGCATGGGGAGCGAACAG
##   3:   AGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTCAAGTCCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCATTGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCAAAT
##   4:   AGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTTAAGTTCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGCGGTGGAAACTACCAAGCTGGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGGTGAAATGCGTAGAGATCGGAAAGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAAT
##   5:   AGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCTGTAGGTGGCTTTTTAAGTTCGCCGTCAAATCCCAGGGCTCAACCCTGGACAGGTGGTGAAAACTACTAAGCTAGAGTACGGTAGGGGCAGAGGGAATTTCCGGTGGAGCGATGAAATGCGTAGAGATCGGAAGGAACACCAACGGCGAAAGCACTCTGCTGGGCCGACACTGACACTGAGAGACGAAAGCTAGGGGAGCGAAT
##  ---                                                                                                                                                                                                                                                          
## 439:    TAGGGGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGCGTAGGCGGCATGGTAAGCCAGATGTGAAAGCCCGCGGCTTAACCGCGCGGATTGCATTTGGAACTATCAAGCTAGAGTACAGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAAGAACACCAGTGGCGAAGGCGGCTTTCTGGACTGAAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAA
## 440:    TAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGTGGCATGGCAAGCCAGAAGTGAAAACCCGGGGCTTAACCCCGCGGATTGCTTTTGGAACTGTCAGGCTGGAGTGCAGGAGGGGCAGGCGGAATTCCTGGTGTAGCGGTGAAATGCGTAGATATCAGGAGGAACACCGGTGGCGAAGGCGGCCTGCTGGACTGTAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAA
## 441:    TAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGTGGCATGGTAAGCCAGAAGTGAAAACCCAGGGCTCAACTCTGTGGATTGCTTTTGGAACTATCAAGCTAGAGTGCTGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGCTTACTGGACAGAAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAA
## 442:    TAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGGTGGCATGGTAAGCCAGAAGTGAAAACCCGGGGCTCAACCCCGCGGATTGCTTTTGGAACTATCAGGCTGGAGTGCTGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGCTTACTGGACAGAAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAA
## 443:    TAGGGGGCAAGCGTTGTTCGGAATGACTGGGCGTAAAGGGTGTGTAGGTGGTTTGATAAGTTAGATGTGTAATACCCAGGGCTTAACTCGGGTGCTGCATCTAAAACTGTTAGACTTGAGTACTGGAGAGGATAGTGGAATTCCTAGTGTAGCGGTGGAATGCGTAGATATTAGGAGGAACATCAGTGGCGAAGGCGACTATCTGGACAGCAACTGACACTGAGGCACGAAAGCGTGGGGAGCAAA
##          Kingdom            Phylum                  Class            Order
##   1: k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales
##   2: k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales
##   3: k__Bacteria  p__Cyanobacteria         c__Chloroplast  o__Streptophyta
##   4: k__Bacteria  p__Cyanobacteria         c__Chloroplast  o__Streptophyta
##   5: k__Bacteria  p__Cyanobacteria         c__Chloroplast  o__Streptophyta
##  ---                                                                      
## 439: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 440: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 441: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 442: k__Bacteria     p__Firmicutes          c__Clostridia o__Clostridiales
## 443: k__Bacteria p__Proteobacteria                     NA               NA
##                  Family          Genus      Species    baseMean
##   1:    f__mitochondria     g__Lupinus    s__luteus   8.1539852
##   2:    f__mitochondria         g__Zea s__luxurians  74.2677521
##   3:                f__            g__          s__  20.9865616
##   4:                f__            g__          s__ 105.1752823
##   5:                f__            g__          s__   2.2078458
##  ---                                                           
## 439:                 NA             NA           NA   0.1974782
## 440:                f__            g__          s__   1.2531111
## 441: f__Lachnospiraceae g__Coprococcus          s__   0.4991015
## 442: f__Lachnospiraceae g__Coprococcus          s__   0.1049244
## 443:                 NA             NA           NA   0.7233974
##      log2FoldChange     lfcSE       stat       pvalue         padj
##   1:      9.3321835 0.8224754 11.3464590 7.722724e-30 4.876349e-28
##   2:      5.7485723 0.9156034  6.2784526 3.419592e-10 3.285781e-09
##   3:      3.8363204 0.9407839  4.0777914 4.546553e-05 2.543768e-04
##   4:      3.6737619 1.0525083  3.4904826 4.821490e-04 2.254025e-03
##   5:      7.5856134 0.9096882  8.3386964 7.511401e-17 9.764822e-16
##  ---                                                              
## 439:      0.9060222 0.8051730  1.1252516 2.604824e-01 3.725995e-01
## 440:      2.0331237 0.6033231  3.3698752 7.520224e-04 3.426741e-03
## 441:      0.2409729 0.6382957  0.3775254 7.057832e-01 7.645984e-01
## 442:      2.2740985 1.0726519  2.1200713 3.400003e-02 8.079578e-02
## 443:      1.5507372 1.2630825  1.2277403 2.195445e-01 3.291055e-01
##      Significant
##   1:        TRUE
##   2:        TRUE
##   3:        TRUE
##   4:        TRUE
##   5:        TRUE
##  ---            
## 439:       FALSE
## 440:        TRUE
## 441:       FALSE
## 442:       FALSE
## 443:       FALSE
volcano = ggplot(
  data = resdt[!is.na(Significant)][(pvalue < 1)],
  mapping = aes(x = log2FoldChange,
                y = -log10(pvalue),
                color = Phylum,
                label = OTU, label1 = Genus)) +
  theme_bw() +
  geom_point() + 
  geom_point(data = resdt[(Significant)], size = 7, alpha = 0.7) + 
  # geom_text(data = resdt[(Significant)], mapping = aes(label = paste("Genus:", Genus)), color = "black", size = 3) +
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5)) +
  geom_hline(yintercept = -log10(alpha)) +
  ggtitle("DESeq2 Negative Binomial Test Volcano Plot\nBaseline Samples") +
  theme(axis.title = element_text(size=12)) +
  theme(axis.text = element_text(size=12)) +
  theme(legend.text = element_text(size=12)) +
  geom_vline(xintercept = 0, lty = 2)
volcano

ggplotly(volcano)

##Ground truth plots

replace_counts = function(physeq, dds) {

  dds_counts = counts(dds, normalized = TRUE)
  if (!identical(taxa_names(physeq), rownames(dds_counts))) {
    stop("OTU ids don't match")
  }
  otu_table(physeq) = otu_table(dds_counts, taxa_are_rows = TRUE)
  return(physeq)

}

## Transform
# Convert raw counts to rlog normalized counts
ps3.rlog <- replace_counts(ps1, dds)

# o__Actinomycetales
ps3_Actinomycetales <- ps3.rlog %>%
    subset_taxa(Order == "o__Actinomycetales") %>%
    psmelt()

p.Actinomycetales <- ggboxplot(ps3_Actinomycetales, x = "Treatment", y = "Abundance", outlier.shape = NA, add = "median_iqr") +
  geom_jitter(mapping = aes(size = 2), alpha = 0.4, na.rm = TRUE, width = 0.3) +
  scale_y_log10() +
  ylab("Abundance (rlog)") +
  theme(axis.title.x = element_blank()) +
  theme(legend.position = "NULL") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  facet_grid(~Genus) +
  stat_compare_means(method = "wilcox.test", label = "p.format")
p.Actinomycetales

# Subset deseq2 ID'd sequence
ps3_Actinomycetales.seq <- subset(ps3_Actinomycetales, OTU == "GTAGGGTGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAATCCCGAGGCTCAACCTCGGGCTTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAACTGACGCTGAGGAGCGAAAGCATGGGGAGCGAAC")
write.table(ps3_Actinomycetales.seq, file = "./results/ps1_Actinomycetales_seq.txt", sep = "\t")

p.Actinomycetales.seq <- ggboxplot(ps3_Actinomycetales.seq, x = "Treatment", y = "Abundance", outlier.shape = NA, add = "median_iqr") +
  geom_jitter(mapping = aes(size = 2), alpha = 0.4, na.rm = TRUE, width = 0.3) +
  scale_y_log10() +
  ylab("Abundance (rlog)") +
  theme(axis.title.x = element_blank()) +
  theme(legend.position = "NULL") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  stat_compare_means(method = "wilcox.test", label = "p.format")
p.Actinomycetales.seq

# Dsiplay current R session information
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2               corrplot_0.77             
##  [3] data.table_1.10.4          randomForest_4.6-12       
##  [5] ggpubr_0.1.5               magrittr_1.5              
##  [7] microbiome_0.99.93         plotly_4.7.1              
##  [9] DESeq2_1.16.1              SummarizedExperiment_1.6.3
## [11] DelayedArray_0.2.7         matrixStats_0.52.2        
## [13] Biobase_2.36.2             GenomicRanges_1.28.4      
## [15] GenomeInfoDb_1.12.2        IRanges_2.10.2            
## [17] S4Vectors_0.14.3           BiocGenerics_0.22.0       
## [19] knitr_1.17                 gridExtra_2.2.1           
## [21] vegan_2.4-3                lattice_0.20-35           
## [23] permute_0.9-4              RColorBrewer_1.1-2        
## [25] phyloseq_1.20.0            dplyr_0.7.2               
## [27] purrr_0.2.3                readr_1.1.1               
## [29] tidyr_0.7.0                tibble_1.3.4              
## [31] ggplot2_2.2.1.9000         tidyverse_1.1.1           
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.0.0            backports_1.1.0        
##   [3] Hmisc_4.0-3             plyr_1.8.4             
##   [5] igraph_1.1.2            lazyeval_0.2.0         
##   [7] splines_3.4.0           BiocParallel_1.10.1    
##   [9] crosstalk_1.0.0         robust_0.4-18          
##  [11] digest_0.6.12           foreach_1.4.3          
##  [13] htmltools_0.3.6         GO.db_3.4.1            
##  [15] checkmate_1.8.3         memoise_1.1.0          
##  [17] fit.models_0.5-14       cluster_2.0.6          
##  [19] doParallel_1.0.10       fastcluster_1.1.24     
##  [21] Biostrings_2.44.2       annotate_1.54.0        
##  [23] modelr_0.1.1            colorspace_1.3-2       
##  [25] blob_1.1.0              rvest_0.3.2            
##  [27] rrcov_1.4-3             haven_1.1.0            
##  [29] RCurl_1.95-4.8          jsonlite_1.5           
##  [31] genefilter_1.58.1       bindr_0.1              
##  [33] impute_1.50.1           survival_2.41-3        
##  [35] iterators_1.0.8         ape_4.1                
##  [37] glue_1.1.1              gtable_0.2.0           
##  [39] zlibbioc_1.22.0         XVector_0.16.0         
##  [41] DEoptimR_1.0-8          scales_0.4.1.9002      
##  [43] mvtnorm_1.0-6           DBI_0.7                
##  [45] Rcpp_0.12.12            viridisLite_0.2.0      
##  [47] xtable_1.8-2            htmlTable_1.9          
##  [49] foreign_0.8-69          bit_1.1-12             
##  [51] preprocessCore_1.38.1   Formula_1.2-2          
##  [53] htmlwidgets_0.9         httr_1.3.1             
##  [55] acepack_1.4.1           pkgconfig_2.0.1        
##  [57] XML_3.98-1.9            nnet_7.3-12            
##  [59] locfit_1.5-9.1          dynamicTreeCut_1.63-1  
##  [61] tidyselect_0.1.1        labeling_0.3           
##  [63] rlang_0.1.2             reshape2_1.4.2         
##  [65] AnnotationDbi_1.38.1    munsell_0.4.3          
##  [67] cellranger_1.1.0        tools_3.4.0            
##  [69] moments_0.14            RSQLite_2.0            
##  [71] ade4_1.7-8              broom_0.4.2            
##  [73] evaluate_0.10.1         biomformat_1.4.0       
##  [75] stringr_1.2.0           yaml_2.1.14            
##  [77] bit64_0.9-7             robustbase_0.92-7      
##  [79] nlme_3.1-131            mime_0.5               
##  [81] xml2_1.1.1              compiler_3.4.0         
##  [83] geneplotter_1.54.0      pcaPP_1.9-72           
##  [85] stringi_1.1.5           forcats_0.2.0          
##  [87] Matrix_1.2-11           psych_1.7.5            
##  [89] multtest_2.32.0         bitops_1.0-6           
##  [91] httpuv_1.3.5            R6_2.2.2               
##  [93] latticeExtra_0.6-28     codetools_0.2-15       
##  [95] MASS_7.3-47             assertthat_0.2.0       
##  [97] rhdf5_2.20.0            rprojroot_1.2          
##  [99] mnormt_1.5-5            GenomeInfoDbData_0.99.0
## [101] mgcv_1.8-18             hms_0.3                
## [103] grid_3.4.0              rpart_4.1-11           
## [105] rmarkdown_1.6           shiny_1.0.4            
## [107] WGCNA_1.61              lubridate_1.6.0        
## [109] base64enc_0.1-3